R is the leading programming language for ecological modelling for good reason. Being free and open-source certainly helps. Having strengths in dataviz also helps. And because of these traits, R now has a huge ecosystem of user-contributed packages that enable sophisticated modelling applications.
This ecosystem of R packages is created by a huge community of R users, many of whom are leaders in their field developing cutting edge statistical models and data science tools. This means if you know R, you can access cutting edge tools and combine them in new ways.
While there are other languages that excel for scientific computing, R owns the market in species distribution modelling (SDMs), the topic of this course.
Until quite recently most R users would prepare their data outside of R (e.g. in Arc GIS) and then read it into R for the SDM. But R now also has efficient and user friendly GIS and mapping packages. This means you can create your entire SDM workflow, from data download to visualization, in R.
But starting an SDM project in R can be daunting for new users. There is a steep learning curve for coding, and there are so many options for packages it is easy to get decision paralysis.
So in this course we are going to provide an introduction to some common SDMs in R. We will also learn how to build an efficient workflow. Our goals today are:
Overview R’s capability for GIS and spatial dataviz
Learn how to build efficient and repeatable workflows for SDMs
Learn how to run some SDMs
Learn how to visualize spatial data and SDM results
This course is suitable for people who have some R experience. It is not a beginners course, but users with a little bit of R experience can follow through the first few sections.
In this course we’ll overview:
GIS in R with shapefiles and rasters (using the modern packages sf and terra)
Generalized linear models
Generalized least squares and models of spatial autocorrelation
Generalized additive models
Spatial prediction and interpolation
Make sure you have a recent version of R (available at the CRAN website) and Rstudio installed (free desktop version is fine).
You will need to install these packages:
install.packages(c("tmap", "tidyverse", "terra",
"sf", "nlme",
"patchwork", "visreg"))
Bumphead parrotfish (Bolbometopon muricatum) are an enignmatic tropical fish species. Adults of these species are characterized by a large bump on their forehead that males use to display and fight during breeding. Sex determination for this species is unknown, but it is likely that an individual has the potential to develop into either a male or female at maturity.
Adults travel in schools and consume algae by biting off chunks of coral and in the process they literally poo out clean sand. Because of their large size, schooling habit and late age at maturity they are susceptible to overfishing, and many populations are in decline.
Their lifecycle is characterized by migration from lagoonal reef as juveniles (see image below) to reef flat and exposed reef habitats as adults. Early stage juveniles are carnivorous and feed on zooplankton, and then transform into herbivores at a young age.
Image: Lifecycle of bumphead parrotfish. Image by E. Stump and sourced from Hamilton et al. 2017.
Until the mid 2010s the habitat for settling postlarvae and juveniles was a mystery. However, the pattern of migrating from inshore to offshore over their known lifecycle suggests that the earliest benthic lifestages (‘recruits’) stages may occur on nearshore reef habitats.
Nearshore reef habitats are susceptible to degradation from poor water quality, raising concerns that this species may also be in decline because of pollution. But the gap in data from the earliest lifestages hinders further exploration of this issue.
In this course we’ll be analyzing the first survey that revealed the habitat preferences of early juveniles stages of bumphead parrotfish. These data were analyzed by Hamilton et al. 2017 and Brown and Hamilton 2018.
In the 2010s Rick Hamilton (The Nature Conservancy) lead a series of surveys in the nearshore reef habitats of Kia province, Solomon Islands. The aim was to look for the recruitment habitat for juvenile bumphead parrotfish. These surveys were motivated by concern from local communities in Kia that topa (the local name for bumpheads) are in decline.
In the surveys, divers swam standardized transects and searched for juvenile bumphead in nearshore habitats, often along the edge of mangroves. All together they surveyed 49 sites across Kia.
These surveys were made all the more challenging by the occurrence of crocodiles in mangrove habitat in the region. So these data are incredibly valuable.
Logging in the Kia region has caused water quality issues that may impact nearshore coral habitats. During logging, logs are transported from the land onto barges at ‘log ponds’. A log pond is an area of mangroves that is bulldozed to enable transfer of logs to barges. As you can imagine, logponds are very muddy. This damage creates significant sediment runoff which can smother and kill coral habitats.
Rick and the team surveyed reefs near logponds and in areas that had no logging. They only ever found bumphead recruits hiding in branching coral species.
In this course we will first ask if the occurrence of bumphead recruits is related to the cover of branching coral species. We will then develop an SDM to analyse the relationship between pollution from logponds and bumphead recruits, and use this SDM to predict pollution impacts to bumpheads across the Kia region.
The data and code for the original analyses are available at my github site. In this course we will use simplified versions of the original data.
An important part of R coding is having an organized workflow. Being organized is important in anything more complex than a one-step R program, which will be most projects involving SDMs. Organizing your workflow requires forward planning and sticking to some strategies. In this course we’ll follow a simple workflow.
There are multiuple benefits of an organized workflow. You code will be more transparent and repeatable, this will benefit both future you and other researchers. It also means you are less likely to make mistakes and the code is easier to debug when you do. Finally, you may want to share the code publicly so other people can repeat your work.
First, you’ll want to identify your research question. If your question is clear then you can stick to what you need to do, and not end up going down rabbit holes creating code you won’t use. Once you have that question I recommmend considering fives steps in your workflow:
You will need to source data, often from online data repositories. Even if you have collected species observations yourself, you will likely need to get environmental covariates for prediction from sources such as weather bureaus, oceanographic repositories or climate models.
The data needs to be read into R and ‘wrangled’ into the appropriate data structure for the modelling packages you will use. Just a warning, some packages require different data structures, so make sure you know what you want! For a spatial model this step can involve a lot of (complex) GIS. Thankfully, R has good packages for GIS.
Before starting I always use R’s powerful data visualisation tools to explore the data. This gives you a deeper understanding of what you’ll model, can help avoid conceptual flaws. Also, you may want to save some data plots for your publication. In this course we’ll use R markdown to make a report on our data that can be easily shared with collaborators.
Finally, we get to the modelling. This is where we’ll spend most time in this course
A powerful way to communicate your models is by making more dataviz. In this course we’ll use dataviz look at what the models say are important environmental drivers of fish abundance, and see how we can use the model to make predictions to unsampled sites.
One final note, the above can be an iterative process. But organize your R scripts like the above and it will be much easier.
So that’s all the background, let’s get started. We’ll work through each step of the above workflow.
You should have the data for this course already, if not download it from github.
Online Repos…. rnaturalearth, satellite products….
library(readr)
library(ggplot2)
library(dplyr)
dat <- read_csv("data/JuvUVCSites_with_ReefTypes_16Jun2016.csv")
names(dat)
## [1] "site" "Reef.ID" "pres.topa" "pres.habili" "secchi"
## [6] "flow" "logged" "coordx" "coordy"
There are coordinates, but because this is a dataframe there is no CRS! Something to make sure you ask data providers (or provide the files as spatial files).
ggplot(dat) +
aes(x = secchi, y = pres.topa) +
geom_point() +
stat_smooth()
ggplot(dat) +
aes(x = logged, y = pres.topa) +
geom_boxplot()
theme_set(theme_classic())
bendat <- read_csv("data/BenthicCoverSurveys.csv")
head(bendat)
## # A tibble: 6 x 5
## site code cover n_pts category
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 AA 0 375 Algal Assemblage
## 2 1 ACB 14 375 Acropora Branching
## 3 1 ACBD 0 375 Acropora Branching Dead
## 4 1 ACD 2 375 Acropora Digitate
## 5 1 ACE 4 375 Acropora Encrusting
## 6 1 ACS 62 375 Acropora Submassive
ggplot(bendat) +
aes(x = category,y = cover) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90))
library(sf)
logponds <- st_read("data/Kia_Logging_Ponds/Kia_Logging_Ponds.shp")
## Reading layer `Kia_Logging_Ponds' from data source `/Users/s2973410/Documents/Code/teaching/SDM-fish-course-2021/SDM-fish-course-notes/data/Kia_Logging_Ponds/Kia_Logging_Ponds.shp' using driver `ESRI Shapefile'
## Simple feature collection with 25 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 425584 ymin: 9143367 xmax: 455466.5 ymax: 9163796
## Projected CRS: WGS 84 / UTM zone 57S
plot(logponds[1])
land <- st_read("data/LandPoly/LandPoly.shp")
## Reading layer `LandPoly' from data source `/Users/s2973410/Documents/Code/teaching/SDM-fish-course-2021/SDM-fish-course-notes/data/LandPoly/LandPoly.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 409995 ymin: 9142285 xmax: 455925 ymax: 9183985
## Projected CRS: WGS 84 / UTM zone 57S
plot(land[1])
CB_dat <- filter(bendat, code == "ACB")
CB_dat <- filter(bendat, code %in% c("ACB", "CB"))
CB_dat <- group_by(CB_dat, site)
CB_dat <- summarize(CB_dat, CB_cover = sum(cover),
n_pts = mean(n_pts))
This whole thing could be more straightforward with pipes:
soft_dat <- bendat %>%
filter(code %in% c("S", "SI")) %>% #Sand or Silt
group_by(site) %>%
summarize(soft_cover = sum(cover),
n_pts = mean(n_pts))
nrow(dat)
## [1] 49
dat2 <- left_join(dat, CB_dat, by = "site")
nrow(dat2)
## [1] 49
dat2 <- left_join(dat2, soft_dat, by = c("site", "n_pts"))
nrow(dat2)
## [1] 49
ggplot(dat2) +
aes(x = logged, y = CB_cover) +
geom_boxplot()
ggplot(dat2) +
aes(x = CB_cover, y = pres.topa) +
geom_point()
ggplot(dat2) +
aes(x = CB_cover, y = soft_cover) +
geom_point()
icol <- sapply(dat2, is.numeric)
pairs(dat2[,icol])
round(cor(dat2[,icol]),2)
kia_crs <- st_crs(logponds)
sdat2 <- st_as_sf(dat2, coords = c("coordx", "coordy"),
crs = kia_crs)
plot(sdat2["pres.topa"])
st_crs(land) == kia_crs
## [1] FALSE
land <- st_transform(land, kia_crs)
st_crs(land) == kia_crs
## [1] TRUE
distmat <- st_distance(sdat2, logponds)
dim(distmat)
## [1] 49 25
apply(distmat, 1, min)[1:5]
## [1] 3396.8083 2440.7593 2432.2577 195.5112 358.5544
sdat2$dist_to_logging <- apply(distmat, 1, min)/1000
ggplot(sdat2) +
aes(x = dist_to_logging, y = pres.topa) +
geom_point()
library(patchwork)
g1 <- ggplot(sdat2) +
aes(x = dist_to_logging, y = secchi) +
geom_point() +
stat_smooth()
g1
g2 <- ggplot(sdat2) +
aes(x = dist_to_logging, y = CB_cover) +
geom_point() +
stat_smooth()
g3 <- ggplot(sdat2) +
aes(x = CB_cover, y = pres.topa) +
geom_point() +
stat_smooth()
gall <- g1 + g2 + g3
gall
ggsave("plot1.png", gall, width = 8, height = 3)
library(tmap)
tm_shape(sdat2) +
tm_symbols(col = "pres.topa", size = 0.2)
tland <- tm_shape(land) +
tm_fill()
tland +
tm_shape(sdat2) +
tm_symbols(col = "pres.topa", size = 0.2) +
tm_scale_bar(position = c("right", "top"))
Preparation
Use distance, not CB cover
Going to start with simpler models and work our way through. In a real workflow you might do things differently.
Focus on model based approaches. Other alteranatives are machine learning, BRTs, maxent, random forest etc… Do so b/c of spatial AC
sdat2$log10_dist_logging <- log10(sdat2$dist_to_logging)
library(visreg)
glm1_gaus <- glm(pres.topa ~ log10_dist_logging + flow, data = sdat2)
par(mfrow = c(2,2))
plot(glm1_gaus)
glm1_pois <- glm(pres.topa ~ log10_dist_logging + flow, data = sdat2,
family = "poisson")
par(mfrow = c(2,2))
plot(glm1_pois)
library(MASS)
glm1_nb <- glm.nb(pres.topa ~ log10_dist_logging + flow, data = sdat2)
par(mfrow = c(2,2))
plot(glm1_nb)
AIC(glm1_gaus, glm1_pois, glm1_nb)
## df AIC
## glm1_gaus 4 323.9075
## glm1_pois 3 331.4446
## glm1_nb 4 184.8777
summary(glm1_nb)
##
## Call:
## glm.nb(formula = pres.topa ~ log10_dist_logging + flow, data = sdat2,
## init.theta = 0.5728221308, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.86220 -0.77891 -0.29183 -0.03023 2.52082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1805 0.9925 -3.205 0.00135 **
## log10_dist_logging 4.1679 0.9167 4.547 5.45e-06 ***
## flowStrong 2.8985 0.7392 3.921 8.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.5728) family taken to be 1)
##
## Null deviance: 78.925 on 48 degrees of freedom
## Residual deviance: 39.362 on 46 degrees of freedom
## AIC: 184.88
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.573
## Std. Err.: 0.176
##
## 2 x log-likelihood: -176.878
sdat2$resid_glm_pois <- resid(glm1_pois)
tm_shape(land) +
tm_polygons() +
tm_shape(sdat2) +
tm_symbols(col = "resid_glm_pois", size = 0.5)
source("data/semivariance.R")
site_distmat <- st_distance(sdat2)/1000
dim(site_distmat)
## [1] 49 49
glm1_pois_semivar <- semivariance(site_distmat, sdat2$resid_glm_pois, ncats = 15)
ggplot(glm1_pois_semivar) +
aes(x = distances, y = semivar) +
geom_point() +
stat_smooth()
glm_nb_plots <- plot_spatial_AC(sdat2, glm1_nb, site_distmat)
tland + glm_nb_plots[[1]]
glm_nb_plots[[2]]
library(nlme)
sdat2$sqrt_topa <- sqrt(sdat2$pres.topa)
m1_gls <- gls(sqrt_topa ~ log10_dist_logging + flow,
data = sdat2)
sdat2$x <- st_coordinates(sdat2)[,1]
sdat2$y <- st_coordinates(sdat2)[,2]
cs1Sph <- corSpher(1, form = ~x + y)
m2_gls <- gls(sqrt_topa ~ log10_dist_logging + flow,
data = sdat2,
correlation = cs1Sph)
plot(m2_gls)
summary(m2_gls)
## Generalized least squares fit by REML
## Model: sqrt_topa ~ log10_dist_logging + flow
## Data: sdat2
## AIC BIC logLik
## 163.9018 173.045 -76.95088
##
## Correlation Structure: Spherical spatial correlation
## Formula: ~x + y
## Parameter estimate(s):
## range
## 2010.678
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.1029171 0.2952781 0.348543 0.7290
## log10_dist_logging 1.1372396 0.3456217 3.290417 0.0019
## flowStrong 1.6198332 0.5677802 2.852923 0.0065
##
## Correlation:
## (Intr) lg10__
## log10_dist_logging -0.712
## flowStrong -0.304 0.067
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.32067681 -0.56075680 -0.09694094 0.34992999 3.30765124
##
## Residual standard error: 1.235041
## Degrees of freedom: 49 total; 46 residual
gls_plots <- plot_spatial_AC(sdat2, m2_gls, site_distmat)
gls_plots[[2]]
library(mgcv)
m1_gam <- gam(pres.topa ~ s(log10_dist_logging) + flow,
family = "nb",
data = sdat2)
visreg(m1_gam)
Compare this to the model where distance isn’t logged!
m1_gam_plots <- plot_spatial_AC(sdat2, m1_gam, site_distmat)
tland + m1_gam_plots[[1]]
m1_gam_plots[[2]]
m2_gam <- gam(pres.topa ~ s(x, y, bs = "gp"),
family = "nb",
data = sdat2)
plot(m2_gam, se = FALSE)
m2_gam_plots <- plot_spatial_AC(sdat2, m2_gam, site_distmat)
tland + m2_gam_plots[[1]]
m2_gam_plots[[2]]
m3_gam <- gam(pres.topa ~ s(log10_dist_logging) +
s(x, y, bs = "gp"),
family = "poisson",
data = sdat2)
plot(m3_gam)
No spatial AC
m1_gam m2_gam
g1 <- visreg(m1_gam, xvar = "log10_dist_logging",
scale = "response", gg = TRUE) +
xlab("Distance to log ponds (log10)") +
ylab("Topa abundance")
g1
Mean plus 95% CIs.
flow_pred <- visreg(m1_gam, xvar = "flow",
scale = "response", gg = TRUE, plot = FALSE)
g2 <- ggplot(flow_pred$fit) +
aes(x = flow, y = visregFit) +
geom_point() +
geom_linerange(aes(ymin = visregLwr, ymax = visregUpr)) +
xlab("Flow") +
ylab("Topa abundance")
gboth <- g1 + g2 +
plot_layout(nrow = 1, widths = c(1, 0.5)) +
plot_annotation(tag_levels = "A")
gboth
ggsave("m1_gam_predictions.png", plot = gboth,
width = 6,
height = 3)
If we want to map our predictions for fish abundance, we will need to use rasters.
If we wanted to generate predictions at the original sample sites, we could just say:
sdat2$gam1_pred <- predict(m1_gam, type = "response")
And then plot the predictions with tmap:
tland +
tm_shape(sdat2) +
tm_symbols(col = "gam1_pred", size = 0.3)
This looks ok, but we might like to extent the edges of the points a bit to fill out the map a bit more. There are a few ways to do this, we will use rasters.
Now we do some tricker to get predictions for the raster grid.
We need to set up a dataframe that has the SST values and x values (longitudes) from the raster and their cell numbers. Cells are numbered 1 to the total number of cells, starting at the top left cell.
I’ve given two options below for choosing cells. The first is more conservative and just chooses cells that have samples (we commented it out). The second uses all ocean cells (we’ll do this one now):
library(terra)
land_terra <- vect(land)
plot(land_terra)
kia_crs_terra <- crs(land_terra)
rlogponds <- rast(extent = ext(land_terra), crs = crs(land_terra),
res = 500)
xy <- st_coordinates(logponds)
icell <- cellFromXY(rlogponds, xy)
rlogponds[icell] <- 1
tland +
tm_shape(rlogponds) +
tm_raster(palette = "Dark2")
rdist <- distance(rlogponds)
rdist_log10 <- rast(rdist)
rdist_log10[] <- log10(rdist[]/1000)
tm_shape(rdist_log10) +
tm_raster(palette = "-YlOrBr", n = 7) +
tland
icell <- 1:ncell(rdist_log10)
pred <- data.frame(log10_dist_logging = rdist_log10[icell][,1],
cells = icell,
x = xFromCell(rdist_log10, icell),
y = yFromCell(rdist_log10, icell),
flow = "Mild")
pred <- filter(pred, is.finite(log10_dist_logging))
head(pred)
## log10_dist_logging cells x y flow
## 1 1.403175 1 410245 9184035 Mild
## 2 1.397940 2 410745 9184035 Mild
## 3 1.392754 3 411245 9184035 Mild
## 4 1.387623 4 411745 9184035 Mild
## 5 1.382555 5 412245 9184035 Mild
## 6 1.377556 6 412745 9184035 Mild
We used na.omit to get rid of NA SST values (land basically).
Now we can use predict to predict the richness values for m_int, but with our new SST and x values, using the newdata argument.
pred$topa_pred <- predict(m1_gam, newdata = pred, type = "response")
We chose the response type, so that predictions units of species richness, not log species richness (because of the log link in the negative binomial).
Now we just assign the predictions back to an empty raster. The empty raster is just a copy of rsst with no data.
rpred <- rast(rdist_log10)
rpred[pred$cells] <- matrix(pred$topa_pred, ncol = 1)
We specify rpred[pred$cells] so it only adds in values in the cells we predited too.
Finally use your tmap skills to make a map Prof Calanoid will love:
tm_shape(rpred) +
tm_raster(palette = "Blues",
title= "Predicted abundance", alpha = 0.8, n=10) +
tm_shape(land) +
tm_fill(col = "black") +
tm_shape(logponds) +
tm_symbols(col = "brown", size = 0.3) +
tm_layout(bg.color = "grey20",
legend.position = c("LEFT", "bottom"),
legend.text.color = "white",
legend.title.color = "white")
It looks blocky, that’s because we predicted richness to the underlying raster. Arguably the ‘blockiness’ is actually good in this case - these are model predictions, not real data, the blockiness serves to emphasise that.
Have a go at doing a map for m2_gam
tmap
https://rspatial.org/raster/sdm/index.html
clone repo so I can get spatial data
reefs <- st_read("data/Kia reefs/Reef_Strata_Kia.shp")
## Reading layer `Reef_Strata_Kia' from data source `/Users/s2973410/Documents/Code/teaching/SDM-fish-course-2021/SDM-fish-course-notes/data/Kia reefs/Reef_Strata_Kia.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 17587500 ymin: -855629.7 xmax: 17657300 ymax: -804145.2
## Projected CRS: World_Cylindrical_Equal_Area
names(reefs)
## [1] "OBJECTID" "L1_CODE" "L1_ATTRIB" "L2_CODE" "L2_ATTRIB"
## [6] "L3_CODE" "L3_ATTRIB" "L4_CODE" "L4_ATTRIB" "REEF"
## [11] "RB_ATTRIB" "RB_CODE" "DEPTH_CODE" "DEPTHLABEL" "RB_DEPTH_C"
## [16] "RB_DEPTH_A" "C_L3Att_L4" "Strata" "Shape_Leng" "Shape_Area"
## [21] "geometry"
st_crs(reefs) == kia_crs
## [1] FALSE
reefs <- st_transform(reefs, kia_crs)
sdat3 <- st_intersection(sdat2, reefs)
Error in CPL_geos_op2(op, x, y) : Evaluation error: TopologyException: Input geom 1 is invalid: Ring Self-intersection at or near point 439615.50008775701 9152047.0008130241 at 439615.50008775701 9152047.0008130241.
reefs2 <- st_buffer(reefs, dist =0.0)
sdat3 <- st_intersection(sdat2, reefs2)
plot(sdat3["L3_ATTRIB"])
But many sites are missing. TODO buffer…
Making a raster of land from polygons with rasterize
rland <- rast(extent = ext(land_terra), crs = crs(land_terra),
res = 100)
rland <- rasterize(land_terra, rland)